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ABSTRACT Identifying Mycobacterium tuberculosis persistence genes is important for developing novel drugs to shorten the du- 
ration of tuberculosis (TB) treatment. We developed computational algorithms that predict M. tuberculosis genes required for 
long-term survival in mouse lungs. As the input, we used high-throughput M. tuberculosis mutant library screen data, mycobac- 
terial global transcriptional profiles in mice and macrophages, and functional interaction networks. We selected 57 unique, ge- 
netically defined mutants (18 previously tested and 39 untested) to assess the predictive power of this approach in the murine 
model of TB infection. We observed a 6-fold enrichment in the predicted set of M. tuberculosis genes required for persistence in 
mouse lungs relative to randomly selected mutant pools. Our results also allowed us to reclassify several genes as required for 
M. tuberculosis persistence in vivo. Finally, the new results implicated additional high-priority candidate genes for testing. Ex- 
perimental validation of computational predictions demonstrates the power of this systems biology approach for elucidating 
M. tuberculosis persistence genes. 

IMPORTANCE Mycobacterium tuberculosis, the causative agent of tuberculosis (TB), has a genetic repertoire that permits it to 
persist in the face of host immune responses. Identification of such persistence genes could reveal novel drug targets and eluci- 
date mechanisms by which the organism eludes the immune system and resists drugs. Genetic screens have identified a total of 
31 persistence genes, but to date only 15% of the -4,000 M. tuberculosis genes have been tested experimentally. In this paper, as 
an alternative to brute force experimental screens, we describe computational methods that predict new persistence genes by 
combining known examples with growing databases of biological networks. Experimental testing demonstrated that these pre- 
dictions are highly accurate, validating the computational approach and providing new information about M. tuberculosis per- 
sistence in host tissues. Using the new experimental results as additional input highlights additional genes for testing. Our ap- 
proach can be extended to other data types and target organisms to characterize host-pathogen interactions relevant to this and 
other infectious diseases. 
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ycobacterium tuberculosis, the causative agent of tuberculosis 
(TB), has evolved adaptive mechanisms to avoid killing by 
host immune responses. Identifying metabolic and regulatory 
pathways required for M. tuberculosis persistence in host tissues 
may elucidate novel strategies to eradicate TB infection. The avail- 
ability of the M. tuberculosis genome sequence has enabled high- 
throughput screens using subsaturated transposon (Tn) mutant 
libraries (1,2). Such libraries have been used to study the genetic 
requirements of the pathogen under physiologically relevant 
stress conditions, including during infection of macrophages (3), 
mice (4-6), guinea pigs (7, 8), and nonhuman primates (9). 

Recently, there has been substantial interest in developing 
computational algorithms for accurately predicting genes essen- 
tial for M. tuberculosis growth and survival. Flux balance analysis 



uses the stoichiometry of biochemical reactions to predict growth 
requirements but is limited to metabolic enzymes (10-12). Other 
approaches have enhanced flux balance analysis by including 
transcriptional profiles and regulatory relationships to constrain 
fluxes through metabolic reactions (13, 14). These approaches 
have been used to predict drug effects on M. tuberculosis mycolic 
acid biosynthesis capacity and transcription factor knockout phe- 
notypes (13, 14). Approaches to predict genetic requirements be- 
yond metabolism would have great value, particularly since only 
660 M. tuberculosis genes (-17% of the genome) are represented 
in metabolic reconstructions. 

Alternative approaches described here combine actual physical 
interactions, including enzyme-substrate and protein-protein in- 
teractions, with functional associations. The resulting networks 
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can be exploited to predict protein function and mutant pheno- 
types (15). Simple metrics, such as shortest distance to known 
genes of interest, have been used previously to predict M. tuber- 
culosis drug resistance genes (16). Graph diffusion kernels, intro- 
duced first for searching Web pages, additionally account for mul- 
tiple independent network paths and improve performance. 
Successes have included predicting epistatic genetic interactions 
in yeast (17, 18), predicting protein function through protein- 
protein interactions (19), and identifying candidate genes for dis- 
ease (20, 21). Biological networks with different interaction types 
can provide complementary information, and integrative ap- 
proaches modeling biological functions have been used to predict 
protein-protein interactions (22, 23), synthetic lethal interactions 
(17), co-complexed pairs (24), and driver missense mutations 
(25). In this study, we combined known M. tuberculosis persis- 
tence genes and transcriptional profiles with networks from met- 
abolic reconstructions and functional associations to make 
genome-wide predictions of genes required for mycobacterial 
persistence in the host (26, 27). The top-ranked predictions were 
then tested experimentally to confirm their accuracy. Further, we 
developed new computational algorithms, incorporating recently 
published data sets (28-31), which together with our new experi- 
mental results highlight additional genes for testing. This study 
extends our knowledge of M. tuberculosis persistence and identi- 
fies potential novel drug targets, with the ultimate goal of short- 
ening the duration of TB treatment. This systems biology ap- 
proach, combining computational predictions with experimental 
validation, is general and readily extended to new data types and 
other target organisms, including host-pathogen interactions rel- 
evant to this and other infectious diseases. 

RESULTS 

Computational predictions. Computational predictions (see 
Data Set SI and S2 in the supplemental material) were used to 
prioritize mutants for experimental tests in mice (Fig. 1; see 
Data Set S3A and B). The predictions propagated gene-based phe- 
notypes (Table 1), including known persistence defects and addi- 
tional informative phenotypes, through M. tuberculosis gene net- 
works to generate gene-based features for predicting additional 
persistence mutants with logistic regression (see Data Set S3C). 

Known in vivo persistence genes were derived from a Tn mu- 
tant screen using designer arrays for defined mutant analysis 
(DeADMAn) (5). This screen identified 31 persistence genes and 
474 genes not required for persistence in mouse lungs. These 
genes served as known positives and negatives, respectively. Addi- 
tional relevant gene data sets included Tn site hybridization 
(TraSH) data derived from mouse spleen (6) and murine macro- 
phages (3). Genes required for in vitro growth were obtained from 
Tn mutagenesis screens (1,2). Genes differentially expressed dur- 
ing infection were obtained from transcription profiling studies 
(32, 33). 

Networks of functional associations were obtained from pub- 
licly available metabolic reconstructions (11) and data integration 
approaches (27). A steady-state graph diffusion kernel propagated 
the gene data (persistence genes, essential genes, and differentially 
expressed genes) through the networks to create features for logis- 
tic regression and support vector machine classifiers (see Data Set 
SI). The full logistic regression model included all 28 features; 
stepwise selection with the Akaike information criterion (AIC) 
eliminated redundant and uninformative features. Twentyfold 



cross-validation was used to assess performance based on the 
known positives and negatives, with area under the receiver oper- 
ating curve ( AUROC) and the maximum harmonic mean of pre- 
cision and recall (F score) serving as quantitative criteria. Ten 
different random 20-way splits were performed to ensure robust 
results. 

Stepwise logistic regression and full logistic regression were 
equivalent, and both regression methods were superior to support 
vector machines (Fig. 2). The F score for all methods is maximal 
near 20 to 30% recall. Stepwise regression at 20% recall is pre- 
dicted to have a mean precision of -50%, an approximately 8-fold 
enrichment compared to the overall estimate of in vivo persistence 
genes within the entire genome (6%) (5). Stepwise logistic regres- 
sion was chosen as the most parsimonious model and used to 
predict genome-wide persistence requirements based on the 1 1 
features selected for the full data (Table 2). Known positives and 
negatives ranked by cross-validation provided empirical estimates 
of precision and recall as a function of ranking. Predicted values 
are provided genome-wide (see Data Set S2). 

Gene selection for experimental verification. The top 75 
computationally predicted genes were selected in rank order, in 
addition to the positive and negative controls, pknF (Rvl 746) and 
Rvl863c (9), respectively, yielding 77 candidate genes. Of these 77 
genes, 7 had unfavorable rankings as the prediction method was 
being developed, and 1 known positive was not selected for test- 
ing, leaving 69 genes selected for testing. Of the 70 corresponding 
mutant strains, 7 failed to grow sufficiently in vitro, yielding 63 
M. tuberculosis Tn mutants corresponding to 62 unique genes in 
the infection pool. 

Experimental verification in the murine model of TB infec- 
tion. On the day after aerosol infection of BALB/c mice, the im- 
plantation dose was determined to be 2.71 ±0.01 log 10 bacilli. The 
output time point of 14 weeks was selected to evaluate mutant 
persistence in mouse lungs for consistency with previous studies 
used for statistical modeling (5). In addition, earlier (day 49) and 
day 196 time points were included to permit a kinetic analysis of 
individual mutant survival. 

Total lung bacillary counts increased and mice gained weight 
as expected (see Data Set S3A and B). Gross examination of mouse 
lungs 49 days postinfection and beyond revealed discrete tubercle 
lesions. Histological evaluation showed cellular aggregates com- 
prising primarily lymphocytes, with few histiocytes and plasma 
cells. Acid-fast bacilli were localized primarily within foamy mac- 
rophages (data not shown). 

The ability of each mutant to survive in the host was ascer- 
tained by quantitative real-time PCR (qPCR). PCR primers failed 
to amplify 5 of the mutants. Of 63 mutants used, 5 (the Rv0099, 
RvOWl, Rvl 183, Rvl821, Rv3823c mutants) repeatedly failed to 
amplify and were removed from further analysis. Data were avail- 
able for a total of 58 mutants corresponding to 57 unique genes, 
including 6 known positives previously characterized as having a 
persistence phenotype, 12 known negatives previously character- 
ized as not required for persistence in mouse lung, and 39 mutants 
previously uncharacterized by DeADMAn. The mean predicted 
precision was 32%. 

Wild-type (null) mutants showed no change in representation 
over time. On the other hand, attenuated mutants showed an 
increase in cycle threshold (C r ) number over time, and "hyper- 
virulent" mutants showed a decreasing C T over time, indicating a 
population fraction increase. Mutants having a multiple-testing- 
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FIG 1 Overview of study design. Phenotypes from previous studies of M. tuberculosis persistence in mouse lungs were combined with high-throughput data and 
functional and metabolic networks to predict new candidate genes for experimental testing. Mutants corresponding to the top-ranked genes were grown, pooled, 
and used for aerosol infection of mouse. Mutants were recovered from lungs at 1, 49, 98, and 196 days postinfection, and abundance for 57 mutants was 
characterized by qPCR. Statistical models identified 23 of the 57 mutants as attenuated, including 18 novel genes, representing a 6-fold enrichment over the 
fraction required for persistence genome-wide. 




corrected P value of 0.05 were classified as either attenuated or 
virulent; both replicates of Rv0169 had concordant null pheno- 
types. Of the 57 unique genes tested, 23 were found to be attenu- 
ated, 3 virulent, and 31 null (Table 3). Roughly equivalent results 
are obtained using a threshold of 95% posterior probability for a 
mutant to belong to the attenuated class. These thresholds corre- 
spond to a change of about 1 C T unit between measurements or an 
average change of 3 C T units (~8-fold attenuation) from the first 
to the last of the 4 time points. 

Statistical assessment of performance on known genes and 
novel predictions. Of the 6 known positives that were tested, 5 
gave growth defects in this test. The single known positive with no 
growth defect was UdD2 (Rv 1872c). However, the previously stud- 
ied Rv 1872c mutant was in an sigF deletion background (5), per- 



haps accounting for the persistence phenotype. Of the 12 known 
negatives that were tested, 8 remained negative. Four, however, 
were attenuated: atsd (Rv0663), hrca (Rv2374c),fadA6 (Rv3556c), 
and Rv3870. All four have been tested previously in related TraSH 
studies, and all but Rv0663 were required for growth in mouse 
spleen (2). The overall concordance for previously characterized 
mutants is at least (5 + 8)/(6 + 12), or 72%, and may be closer to 
(5 + ll)/(5 + 12), or 94%. 

Of the 39 unique novel genes tested, 22 had no persistence 
defect and 17 were found to have a non-wild-type phenotype, 14 
with persistence defects and three with increased growth relative 
to the wild type (Table 3). The attenuation ranged from 8-fold (the 
lower limit for statistical significance) to over 100,000-fold (the 
dynamic range of qPCR) (Fig. 3). Counting only the attenuated 
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TABLE 1 Sources of data input into computational models 



Source 



Description 



Edge wt 



Gene wt 



M. tuberculosis networks 

STRING functional associations 
(27) 

BiGG metabolic reconstruction 
(11) 

M. tuberculosis essential genes 
Transposon mutants (1) 

TraSH (2) 
M. tuberculosis persistence genes 
DeADMAn in mouse (5) 



TraSH in mouse (6) 

TraSH in mouse macrophage (3) 
TraSH in mouse macrophage (3) 

TraSH in mouse macrophage (3) 

M. tuberculosis differentially 
expressed genes 
Mouse infection (33) 
Macrophage infection (32) 



3,964 nodes, 496,278 edges 
661 nodes, 217,470 edges 



3,795 genes, Gibbs sampling posterior 

probability 
3,172 genes 

31 persistence genes, 474 nonpersistence 
genes 

2,967 genes, measured 8 weeks after 
infection 

2,859 genes, unactivated macrophage 
2,859 genes, activated with IFN-7 

before infection 
2,859 genes, activated with IFN-y 

after infection 



Weeks 1, 2, 4, and 8 after infection 
Hours 4 and 24 after infection 



Combined score £ [0, 1] 
Poisson score mapped to [0, 1] 



Pr(essential) G [0, 1] 
Log(input/output) 

+ 1 (persistence), 

— 1 (nonpersistence) 

0 (untested) 
Log(input/output) 

Log(input/output) 
Log(input/output) 

Log(input/output) 



Log(input/output) 
Log(input/output) 



strains as correct predictions, this 14/39 or 36% success rate is 
close to the 32% success rate predicted by the statistical model and 
represents a 6-fold enrichment over the 6% estimate of in vivo 
persistence genes (5). 

The 23 genes required for persistence in mouse lungs in this 
assay include 5 that were previously known to be required and 18 
novel genes that were either not tested or likely false negatives in 
previous mouse lung screens (Table 3). 

Concordance of experimental model systems. This and a pre- 
vious study (5) used medium-throughput assays to test 545 geno- 
typically characterized mutants for persistence in mouse lungs 
following aerosol infection (see Data Set S4A to E). Similar mu- 



tants have also been tested as part of high-throughput, complex 
libraries using TraSH to study bacillary survival in macrophages 
and in mouse spleen following intravenous infection (3, 6). Of the 
459 genes tested by all three systems, 76 of the corresponding 
mutants have a defect in at least one of the three systems: 8 are 
attenuated in all three systems, and an additional 18 are attenuated 
in two of the three systems (see Data Set S4F). 

All pairwise comparisons of mutant phenotypes with 2-by-2 
contingencies are highly significant (see Data Set S4G and H). It 
does appear, however, that the in vivo DeADMAn system is more 
similar to the corresponding in vivo TraSH mouse system (odds 
ratio of 13.3, Fisher's exact one-sided P value of 1.3 X 10 _1 °) than 




n 1 1 1 1 r 1 1 1 1 1 r 

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 



False positive rate Recall 

FIG 2 Statistical assessment of prediction methods. Predictions using logistic regression with stepwise selection by AIC (solid, green), logistic regression with 
a full model (dashed, orange), and a support vector machine (solid, red) are assessed by receiver operating characteristic (A) and precision recall using 20-fold 
cross-validation (B). Logistic regression with a full model or stepwise selection provide equivalent performance and are superior to the support vector machine. 
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TABLE 2 Stepwise logistic regression model 



Feature Coefficient P value 



Intercept 


1*7 f* f* — 1— 1 f\/"*7 0*7 

— 17.55 ± 1,067.87 


9.86 


X 


10 




GDK (SIRING, DeADMAn in mouse) 


8.37 ± 2.21 


1.57 


X 


10 




GDK (STRING, TraSH in mouse macrophage after IFN-y) 


0.66 ± 0.37 


7.47 


X 


10 


-2 


GDK (STRING, TraSH in mouse) 


1.62 ± 0.42 


1.14 


X 


io- 


-4 


GDK (STRING, TraSH essential genes) 


0.40 ± 0.23 


8.95 


X 


io- 


-2 


GDK (metabolic, DeADMAn in mouse) 


-110.06 ± 78.75 


1.62 


X 


10 


-1 


GDK (metabolic, TraSH in mouse macrophage unactivated) 


-1.35 ± 0.76 


7.62 


X 


10 


-2 


GDK (metabolic, transposon mutants) 


1,232.09 ± 793.44 


1.20 


X 


10 


-1 


Mouse infection day 14 


-0.63 ± 0.42 


1.39 


X 


io- 


-1 


Mouse infection day 2 1 


0.65 ± 0.22 


3.42 


X 


io- 


-3 


Indicator (mouse infection day 7) 


-3.04 ± 1.44 


3.50 


X 


10 


-2 


Indicator (mouse infection day 14) 


17.82 ± 1,067.88 


9.86 


X 


10 


- 1 



" GDK (network, gene data) indicates features from a graph diffusion kernel with the given network and gene data. 



to TraSH in macrophages (odds ratio of 7.4, P value of 3.4 X 
10~ 5 ). The two TraSH systems are also significantly correlated 
(odds ratio of 14.2, P value of 5. 2 X 10~ 9 ). Of genes attenuated by 
TraSH overall, 37% are also required for persistence in mouse 
lungs, similar to the predictive performance of the statistical 
model. It is important to note, however, that this study identified 
12 of the genes attenuated in both. Prior to this study, only 24% of 
genes attenuated by TraSH were also found to be attenuated using 
DeADMAn. Furthermore, of the mutants tested across all three 
systems, distinct sets are attenuated in only a single system: 20 are 
unique to DeADMAn, 18 are unique to TraSH in mice, and 12 are 
unique to TraSH in macrophages. These results suggest corre- 
sponding distinct mechanisms. The number of mutants unique to 
TraSH in macrophages is smallest, possibly because macrophage 
infection is common to all three systems. 

Predictions with updated external data and new results from 
this study. We investigated (see Data Set S4 to S6) whether re- 
cently reported external data improved our predictions (28-31). 
Incorporating four new external data sets with improved annota- 
tion of essential genes did not improve the predictions: the area 
under the curve (AUC) remained close to 0.69 and the F score 



remained close to 0.30 (see Data Set S4A and B). We also updated 
the predictions by including the new experimental results of this 
study, which update gene labels from "untested" to either "atten- 
uated" or "null," together with the four new external data sets 
(Fig. 4). In the three cases where the new experimental results 
conflicted with previous results (IprK [Rv0173], UdD2 [Rvl872c], 
tig [Rv2462c]), we used the new results for cross-validation tests. 
Here, the prediction performance improved substantially, with a 
new AUC of 0.77 and a new F score of 0.42 (see Data Set S4C and 
D). Three genes are particularly noteworthy in rising substantially 
in priority and also having mutants available for testing: Rvl410c, 
fadD21 (Rvll85c), andpheA (Rv3838c). 

DISCUSSION 

Although many studies have highlighted the importance of vari- 
ous adaptive mechanisms in promoting the long-term persistence 
of M. tuberculosis in host tissues, the M. tuberculosis molecular 
pathways underlying long-term survival in the infected host re- 
main largely undefined (34-36). This information is not only im- 
portant for improving our understanding of TB pathogenesis but 
could also serve as the basis for the rational development of novel 



TABLE 3 Experimental results of Tn mutant survival in mice and comparison with prior high-throughput studies 



Gene(s) 


Count 


Result 
This screen 


DeADMAn (5) 


TraSH (3, 6) 


mkl (Rv0655) 


1 


Attenuated 


Attenuated 


Attenuated 


mmpLll {Rv0202c),fadD26(Rv2930) 


2 


Attenuated 


Attenuated 


Null 


mmpL4 (Rv0450c),pknF (Rvl746) 


2 


Attenuated 


Attenuated 


Untested 


hrcA (Rv2374c),fadA6 (Rv3556c), Rv3870 


3 


Attenuated 


Null 


Attenuated 


atsD (Rv0663) 


1 


Attenuated 


Null 


Null 


pksl6 {Rvl013), Rvl045, IprG (Rvl411c), bioA {Rvl568), aceE 


9 


Attenuated 


Untested 


Attenuated 


(Rv2241), cpsA (Rv3484), Rv3683, Rv3723, Rv3871 










pntB (Rv0157), Rvl226c, Rvl591, mez (Rv2332), tig (Rv2462c) 


5 


Attenuated 


Untested 


Null 


lldD2 (Rv 1872c) 


1 


Null 


Attenuated 


Null 


mcelA (Rv0169),Rv2707 


2 


Null 


Null 


Attenuated 


mmpL6 {Rvl557), Rvl863c, Rv2674, ppsE {Rv2935), pksl (Rv2946c) 


5 


Null 


Null 


Null 


fadD28 (Rv2941) 


1 


Null 


Null 


Untested 


IprK (Rv0173), Rv0176, pcaA {Rv0470c), IpqY (Rvl235), ppgK 


9 


Null 


Untested 


Attenuated 


(Rv2702), drrA (Rv2936), Rv3236c, Rv3616c, Rv3910 










gca (Rv0112), Rv0203, Rv0660c, Rv0662c,fabG (Rv2766c), dinF 


9 


Null 


Untested 


Null 


(Rv2836c),ppsC (Rv2933), Rv3253c, aspB (Rv3565) 










ppsA (Rv2931), ppsB (Rv2932), ppsD (Rv2934), Rv3273 


4 


Null 


Untested 


Untested 


rodA (Rv00l7c) 


1 


Virulent 


Untested 


Attenuated 


gabD2 (Rvl731),pgsA2 (Rvl822) 


2 


Virulent 


Untested 


Null 
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AC t , Day 92 -Day 1 
AC t , Day 49 -Day 1 
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Attenuated Not Attenuated Virulent 

FIG 3 M. tuberculosisTn mutant survival, as assessed by qPCR. Genes are sorted in decreasing order of f3 g (blue line), the regression fit of the change in AC r over 
3 time intervals; large positive values correspond to attenuated mutants (green), and large negative values correspond to virulent mutants (red). The AC r values 
at day 49 (dotted line), day 98 (dashed line), and day 196 (solid line) are shown relative to the day 1 baseline. 



sterilizing drugs to shorten the duration of TB chemotherapy. The 
computational methods developed here provide a genome-scale 
ranking of bacterial mutants by predicting persistence pheno- 
types. The predictions are then validated by medium-scale tests of 
tens to hundreds of mutants in a mouse model. Using this ap- 
proach, we observed a 6-fold enrichment in the predicted set of 
M. tuberculosis genes required for persistence in mouse lungs rel- 
ative to randomly selected mutant pools. 

We identified 18 genes, which were previously not character- 
ized as M. tuberculosis persistence in animal lungs. Of these genes, 
Rvl013, Rvl411c, Rv2374c, Rv2462c, Rv3484, Rv3556c, Rv3683, 
Rv3870, and Rv3871 were found to be significantly differentially 
expressed during nutrient deprivation of M. tuberculosis (37, 38), 
consistent with the hypothesis that the encoded products are in- 
volved in adaptation of M. tuberculosis to the nutrient-deprived 
environment of mouse lungs during chronic infection. The novel 
persistence genes Rvl226c, Rv2462c, Rv3556c, Rv3683, and Rv3723 
were shown to be significantly differentially regulated by M. tuber- 
culosis upon inorganic phosphate limitation, suggesting that the 
cognate products may contribute to bacillary survival within the 
phosphate-starved environment of the macrophage phagolyso- 
some during chronic infection (3, 39). These genes represent po- 
tential novel drug targets but require further validation in individ- 
ual infections. 

The M. tuberculosis genome contains a number of genes be- 
longing to the family of polyketide synthases (PKSs), which cata- 
lyze the formation of polyketide secondary metabolites (40). The 
PKSs are structurally and mechanistically related to the fatty acid 
synthases (FASs), which are involved in the biosynthesis of fatty 
acids. Recent reports suggest that proteins encoded by the three- 
operon fadD26-mmpL7 locus (fadD26 ppsA-ppsE, drrA-drrC, 
papA5 masfadD28 mmpL7) play major roles in phthiocerol dimy- 
cocerosate (PDIM) biosynthetic and transport pathways, which 
are required for virulence (41-44). Out of 13 genes in this locus, 



we tested 7 genes in the current study: fadD26, a known positive, 
and ppsA-ppsE and drrA, all previously untested in mouse lungs, 
except for the known negative ppsE. While attenuation of the 
fadD26 mutant was confirmed, none of the remaining genes was 
required for persistence in mouse lungs. Although the drrA and 
drrB genes are required for macrophage infection (3), our data 
suggest that they are not required for M. tuberculosis survival in 
mouse lungs. 

The PKS genes pksl, pkslO (45), and pks7 (46), which are in- 
volved in dimycocerosyl phthiocerol synthesis, were reported to 
be required for M. tuberculosis persistence in mice (45, 46). In the 
current study, a pfaJ6-deficient mutant showed reduced persis- 
tence in mouse lungs, while the pksl -deficient mutant showed no 
survival defect. The discrepancy between our findings and those of 
Sirakova et al. may be due to the different strains of mice (BALB/c 
and C57BL6/J, respectively), different routes of infection (aerosol 
and intranasal, respectively), different inoculating dose (10 2 and 
10 4 CFU, respectively), or model system (pooled and individual 
infection, respectively) (45). It is unlikely that the function of the 
Pksl protein was not abrogated in our mutant, since the Tn inser- 
tion is at 2,869 bp (total gene length = 4,863). Although pks7 was 
previously reported to be an essential gene (2), our data are con- 
sistent with other studies demonstrating that the gene is dispens- 
able for in vitro growth but essential for M. tuberculosis survival in 
mice (4). 

Of the 12 M. tuberculosis genes designated mycobacterial mem- 
brane protein large (mmpLl to mmpL12), we studied three 
(mmpL4, mmpLll, mmpL6) and confirmed the results of earlier 
high-throughput screens demonstrating that the first two genes 
are required for long-term bacillary survival in mouse lungs (5, 
41). MmpL4 and MmpLll are predicted to serve as lipid trans- 
porters and have been shown to have a role in M. tuberculosis 
virulence in mice (47). 

The genes Rv3870 and Rv3871, which together with Rv3877 
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FIG 4 Predictions of probabilities of persistence defects for deletion mutants, including new results from this study (y axis), are compared with original 
predictions at the start of this study (jc axis). Colors indicate previously known and new positive attenuated mutants (solid and open green), previously known 
and new negative nonattenuated mutants (solid and open red), untested mutants available for testing (black circles), and mutants unavailable for testing because 
they are essential (solid gray) or otherwise unavailable (open gray). 



encode cytosolic or membrane -bound components of the ESX-1 
secretion machinery, were found to be required for persistence in 
mouse lungs in the current study. Our findings are consistent with 
prior studies demonstrating a requirement for Rv3871 in M. tu- 
berculosis survival in murine macrophages (3) and lungs (2), as 
well as in nonhuman primate lungs (9). Together, these results 
indicate the central role for the ESX secretion pathway in M. tu- 
berculosis virulence (48). 

Interestingly, four mutants {R vOO 1 7c: :Tn, Rv0112::Tn, 
Rvl731::Tn, and RviS22::Tn) were more abundant in the mouse 
lungs at days 98 and 196 relative to day 49. Data for two mutants 
(Rv0017c::Tn and Rv0112::Tn) appear to conflict with earlier 
TraSH-based studies reporting that RvOl 12 is an essential gene (2) 
and that RvOOl 7c is required for M. tuberculosis survival in pri- 
mary murine macrophages (3). Since the Tn insertion in our mu- 
tant, the RvOl 12::Tn mutant, is at base pair position 91 (total gene 



length = 957 bp), gene function is expected to be disrupted, indi- 
cating that it is, in fact, a nonessential gene ( 1 ) . The discrepancy in 
our findings and those of Rengarajan et al. (3) regarding Rv0017c, 
which encodes a probable cell division protein RodA, may be due 
to differences in models (mouse versus macrophages) or tech- 
niques used to assess mutant growth and survival (qPCR versus 
microarrays). 

The current study demonstrates that a network-based compu- 
tational approach integrating diverse high-throughput data sets 
may be used to predict genes essential for M. tuberculosis persis- 
tence in mouse lungs. These computational predictive algorithms 
can be further improved by iterative refinement through active 
learning or by including data from additional relevant model sys- 
tems, M. tuberculosis regulatory networks (49), and operon struc- 
ture. To test this hypothesis, we updated the external data by in- 
cluding four new essential gene data sets and updated the training 
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data by using the new experimental results from this study. The 
new experimental results highlighted three additional genes as 
high-priority candidates for testing. Additional rounds of experi- 
mentation and modeling could therefore lead to even greater 
knowledge of the genetic requirements for M. tuberculosis persis- 
tence. We believe future work should focus on the development of 
small molecule inhibitors of the most promising candidates iden- 
tified through such systems biology-based approaches, with the 
ultimate goal of shortening the duration of TB chemotherapy. 

MATERIALS AND METHODS 

Network data. A functional association network for the M. tuberculosis 
H37Rv strain was obtained from the STRING database (27). A metabolic 
reconstruction for H37Rv (11) was converted to a functional association 
network using the log-likelihood ratio p for shared metabolites (18) and 
then mapped to the weight: 1/(1 + e 2 ~ p). Protein-protein interactions 
from yeast two-hybrid screens (50) are included in STRING and did not 
improve performance when used as a separate feature. 

Essential genes in vitro. Probabilities that genes are essential for 
M. tuberculosis growth in nutrient-rich broth were compiled from two 
random mutagenesis studies and Gibbs sampling with mutant survival 
data (1, 2, 51). Probabilities were recalculated using the "negenes" 
R-package (http://www.biostat.wisc.edu/~kbroman/software/) from cur- 
rent data available from the Tuberculosis Animal Research and Gene 
Evaluation Taskforce (TARGET) (http://webhost.nts.jhu.edu/target/). 

Persistence genes in vivo. Genes required for M. tuberculosis survival 
in mouse tissues (persistence genes) were obtained from two previous 
studies (5, 6). In addition, data were extracted from a Tn mutant study in 
macrophages derived from C57BL/6! bone marrow with and without 
gamma interferon (IFN-y) activation (3). Persistence genes fromM. tu- 
berculosis strain CDC 1551 were mapped to H37Rv orthologs from Tu- 
bercuList (52). Scores s g were log output pool/input pool for each gene g, 
and s g = 0 for untested genes. The 8-week time point from the Sassetti et 
al. study (6) was selected as the closest match to the 49-day time point in 
the Lamichhane et al. study (5). Class totals for each study were 



and normalized weights w g v/eres g S to JS ± for ±s g > 0 andS to( = S + + S_. 

Transcriptional profiling. Transcriptional data of M. tuberculosis 
H37Rv during infection of mouse lungs and bone marrow-derived mac- 
rophages were obtained from the TB database (53). Features, defined as 
positive or negative weights w g for each gene g, were the log ratios of the 
transcriptional profiles obtained at 1, 2, 3, and 4 weeks (33) or 4 and 24 h 
(32) postinfection. 

Features from graph diffusion kernels Please see Data Set SI for a 
detailed description. 

Classification and cross-validation performance assessment. Please 
see Data Set SI for a detailed description. Software and data sets are avail- 
able in the supplemental material (see Data Set SI and reference 54). 

Mutant pool generation for experimental studies. A library of 5,126 
unique transposon (Tn) insertion mutants in 2,246 unique genes in CDC 
1551 was generated previously (1). The top 75 genes with Tn mutants 
available were considered in rank order, and 67 were selected for testing. A 
positive control, JHU1 746-380, an in vivo persistence mutant containing a 
Tn insertion in gene RvJ746/MT1788, and a negative control, JHU1863c- 
275, a fully virulent mutant containing a Tn insertion in gene Rvl863c/ 
MT1912, were also added to the pool. 1HU0169-511 and 1HU0169-573 
mutants were internal controls with Tn insertions in the same gene but at 
different positions (511 bp and 573, respectively). Each mutant was grown 
individually at 37°C in supplemented Middlebrook 7H9 medium (Difco) 
containing 20 p.g/ml kanamycin (Sigma) to mid-log phase (optical den- 
sity at 600 nm [OD 600 ] of ~0.6). The 63 different mutants in 62 unique 
genes were pooled by combining an equal volume of each strain. 



Mouse infection. All procedures involving animals were performed in 
compliance with the U.S. Animal Welfare Act regulations and Public 
Health Service Policy according to protocols approved by the Institutional 
Animal Care and Use Committee at Johns Hopkins University. All mice 
were maintained and bred under specific-pathogen-free conditions and 
fed water and chow ad libitum. Female BALB/c mice (5 to 6 weeks old; 
Charles River) were infected via the aerosol route using an inhalation 
exposure system (Glas-Col) with 2 log 10 bacilli. Five mice per group were 
sacrificed at days 1, 49, 98, and 196 postinfection. Both lungs were homog- 
enized in phosphate-buffered saline (PBS), plated on supplemented 
Middlebrook 7H10 solid medium (Difco) containing 20 /ug of kanamy- 
cin/ml, and incubated at 37°C at least 3 weeks before colony enumeration 
or DNA extraction. 

Real-time PCR. For each time point, approximately 1,000 colonies 
were scraped and pooled, and genomic DNA (gDNA) was prepared (4, 5, 
7). The gDNA preparations from each experimental group were pooled, 
and qPCR was performed in duplicate using iCycler iQ (version 3.1.7050; 
Bio-Rad). Mutant-specific primer sets, each composed of a generic Tn 
primer and a gene-specific primer, were designed to amplify 150- to 
200-bp DNA fragments and validated by amplifying the correct-sized 
fragment by conventional PCR. For a given qPCR run, the cycle threshold 
(C r ) for Tn mutant gis C r (g) and for the housekeeping gene sigA is Cj(h). 
The difference C r (g) — C T (h) is AC r (gtr), where g labels the mutant, t 
labels the four time points (day 1, 49, 98, or 196), and /"labels the technical 
replicate (1 or 2). Finally, y^ is the average of the replicates: y^ = 
[AC r (grl) + AC r (gr2)]/2. A detailed description of the qPCR data anal- 
ysis is provided in Data Set SI. Software, data, and expectation- 
maximization detailed methods are available in the supplemental material 
(see Data Set SI). 

New predictions based on additional experimental data sets and 
new experimental results. We collected essentiality data sets from four 
papers published after the initial selection of candidates for testing (28- 
31). Three of these new data sets rely on improved experimental methods 
using next-generation sequencing to identify TA sites lacking transposon 
insertions. Different methods characterize essential genes based on the 
number of consecutive TA sites without observed insertions (29) or iden- 
tify overlapping genome regions lacking transposon insertions and then 
identify genes overlapping these essential regions (29, 31). New Bayesian 
methods using extreme value distributions to describe runs of TA sites 
have also been applied to estimate posterior probabilities of essentiality 
for each gene (28, 29). In addition to these experimental approaches, a 
recent computational method employed a metabolic reconstruction and 
flux balance analysis (FBA) to identify essential metabolic genes (30). 
These data sets generally identify 700 genes overall as essential, of which 
about 200 are metabolic (see Data Set S5A). These four data sets were 
incorporated as additional essential gene features and propagated through 
the biological networks using graph diffusion kernels. 

New predictions also relied on updated "attenuated" and "nonattenu- 
ated" gene labels according to the new results for mutants tested experi- 
mentally. Mutants found to be virulent were labeled as nonattenuated. 

We generated new predictions in two stages: first, we included just the 
new external data; then, we also included the updated gene labels. These 
predictions used the same methods as described for the original set of 
predictions used to prioritize genes for testing (see Data Set S6). 

SUPPLEMENTAL MATERIAL 

Supplemental material for this article may be found at http://mbio.asm.org 
/lookup/suppl/doi: 1 0. 1 1 28/mBio.0 1066- 1 3/-/DCSupplemental. 

Data Set SI, DOCX file, 0 MB. 

Data Set S2, XLSX file, 0.9 MB. 

Data Set S3, PPTX file, 0.1 MB. 

Data Set S4, DOCX file, 0.2 MB. 

Data Set S5, DOCX file, 0 MB. 

Data Set S6, XLSX file, 0.3 MB. 
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